Random Forests belong to the class of ensemble methods. The goal of ensemble methods is to combine the predictions of several base estimators built with a give learning algorithm in order to improve generalizability/ robustness over a single estimator.

There are two families of ensemble methods:

  • Average methods: build several estimators independently and then to average their predictions. Examples: Bagging methods, Forest of randomized trees.

  • Boosting methods: base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Examples: AdaBoost, Gradient Tree Boosting,...

Bagging meta-estimator

  • Build several instances of a blackbox estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction.

  • In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator, taking as input a user-specified base estimator along with parameters specifying the strategy to draw random subsets:


In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), max_samples = 0.5, max_features=0.5)

Forests of randomized trees

The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. These two has perturb-and-combine style: a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.


In [3]:
from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X,Y)

Random Forests

Each tree is the ensemble built from a sample drawn with replacement from the training set.

The scikit-learn implementation combines classifiers by averaging their probablistic prediction, instead of letting each classifier vote for a single class.

Extremely Randomized Trees

Randomness goes one step further in the way splits are computed.

As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random at each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.

This allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

Parameters

n_estimators and max_features.

n_estimators is the number of tres in the forest. The larger the better, but also the longer it will take to compute.

max_features is the size of random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater in increase in bias.

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks.

Good results are often achieved when max_depth=None in combination with min_samples_split=1.

The best parameter values should always be cross-validated.

In random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False).

Parallelization

n_jobs

If n_jobs=k then computations are partitioned into k jobs, and run on k cores of the machine.

IF n_jobs=-1 then all cores available on the machine are used.


In [ ]: